134 ◾ Bioinformatics
4.2.2.2.7 Mark the duplicate reads
GATK4 best practice recommends removing or marking the duplicate reads mapped to the
reference genome to avoid biases in variant calling. In the above examples, we used sam-
tools to mark the duplicate reads. With GATK4 best practice, we can use Picard to mark
the duplicate. Picard is a set of command-line tools for manipulating sequencing data such
as SAM/BAM and VCF. To install Picard, follow the instructions at “https://broadinsti-
tute.github.io/picard/”.
The Picard MarkDuplicates function works by comparing reads in the 5′ positions and
collecting the duplicates. The function identifies the primary and duplicate reads. The
duplicates are marked with the hexadecimal value of 0x0400. For more details, check out
Picard manual.
The following script creates a directory called “dedup”, marks the duplicate alignments
in BAM files and indexes the resulted BAM files, and stores them in the new directory:
mkdir dedup
cd chr21
for i in $(ls *.bam|rev|cut -c 5-|rev);
do
java -Xmx7g -jar ~/software/picard.jar MarkDuplicates \
-INPUT ${i}.bam \
-OUTPUT ../dedup/${i}.dedup.bam \
-METRICS_FILE ../dedup/${i}.dedup_metrics.txt
samtools index ../dedup/${i}.dedup.bam
done
cd ..
4.2.2.2.8 Adding read grouping header to each BAM file
RGs are identified in the BAM file by a number of tags that are defined according to a specific
format. We can use the following samtools command to display the RG tags on a BAM file:
samtools view -H filename.bam | grep “^@RG”
GATK4 RGs’ tags allow us to differentiate samples and technical features that are asso-
ciated with artifacts. Such information is used to reduce the effects of artifacts during the
duplicate marking and base recalibration steps. GATK requires several RG fields to be pres-
ent in input files and will fail with errors if this requirement is not met. The RG fields are
usually set when the reads are aligned to a reference genome by the aligner. However, if the
BAM files do not include the RG fields, we can use the “AddOrReplaceReadGroups” Picard
function to add or modify the RG fields as we will do next. AddOrReplaceReadGroups uses
the following arguments to add or modify the RG fields on a BAM file:
RGID: This adds group read ID, which must be unique across the samples.
RGPU: This argument is optional for GATK, and it holds the flow cell barcode, lane ID,
and sample barcode separated by a period “.”.
RGSM: This holds the sample name that will be used later as a header of sample column
in the VCF file.